Open In Colab

Assignment 2

There is a separate script assignment_2.py which handles execution of the optuna studies. In this notebook, we load these results and briefly discuss them.

Comments on the first study

These values seem strange at first glance. The paper that this architecture is taken from uses 3 z-layers and achieves a much better accuracy than roughly $0.55$. Also, in our first tests with this architecture, we needed only a few attempts at manual fine-tuning to get better than $0.6$ accuracy. Therefore, we examine the results below.

The plot above looks strange, apparently the trials close to the best one where already found very early, then no improvement followed. Next, we check which importance each parameter was assigned.

It seems that the number of z-layers was the most important parameter. We suspect that this parameter was therefore always set to 1, we can quickly verify this by getting a histogram over all parameter values for the number of z-layers.

Indeed, the number of z-layers was 1 in the vast majority of trials. This very important parameter seems entirely underexplored. We conduct another study, this time setting the number of z-layers to a minimum of 2, and also using an experimental multivariate mode in the optuna study sampler. We load this, and perform the same checks.

Comments on the second study

These values seem more realistic, and the best trial is close to the performance that we got by manual tuning. Let's check this study, too.

Again, good values were found very quickly. It seems that optuna is good at finding (local) maxima very quickly. However, from the paper we are basically trying to reproduce, it seems that an even higher accuracy should be possible with 3 z-layers. Therefore, lets look at the number of z-layers again.

These stats look more healthy now. However, the number of automatically pruned trials, meaning trials which did not show a good accuracy quickly enough, is very high for this study. We suspect that the pruning got rid of a lot of potentially promising 3-layer models, as they take longer to train. Next time, we will be more careful with the pruning.

Training the model and presenting the results

Plot the learning curves for loss and accuracy.

Also show the confusion matrix.

Final comments

The accuracy here indeed reaches 60%, but the learning curve clearly shows overfitting. For the next time, we think it would be beneficial to forego optuna's automatic pruning and instead implement a stopping criterion for the model training, for example something like 'early' stopping based on the test set accuracy. Then, trial results for different numbers of z layers would be more comparable, and the hyperparameter optimization should have an easier time finding configurations which are a little better. However, 60% is probably fine here, and the paper the architecture comes from might have achieved better results because we did not replicate their training process, which is more involved than what we used here.

Comment from Jan on optuna: I really like the way that the parameters to be optimized are defined in the code. However, I was unable to find a way to have several processes work on the same study in a thread-safe manner without having to use some additional sqlite database, which seems very inconvenient to me. WandB offers the same visualizations, allows conveniently checking the study progress online with a superb git integration, and has a trivial setup process for configuring multiple agents, across several gpus and even machines, to work on the same study. Also, having used it countless times, I have never had an instance of a problem where 'computer science student descent' was more effective than the parameters that WandB would find after sweeping for a while. I remain somewhat unconvinced that optuna is the better option.